Goto

Collaborating Authors

 breadth and depth


R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

arXiv.org Artificial Intelligence

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.


"The Diagram is like Guardrails": Structuring GenAI-assisted Hypotheses Exploration with an Interactive Shared Representation

arXiv.org Artificial Intelligence

Data analysis encompasses a spectrum of tasks, from high-level conceptual reasoning to lower-level execution. While AI-powered tools increasingly support execution tasks, there remains a need for intelligent assistance in conceptual tasks. This paper investigates the design of an ordered node-link tree interface augmented with AI-generated information hints and visualizations, as a potential shared representation for hypothesis exploration. Through a design probe (n=22), participants generated diagrams averaging 21.82 hypotheses. Our findings showed that the node-link diagram acts as "guardrails" for hypothesis exploration, facilitating structured workflows, providing comprehensive overviews, and enabling efficient backtracking. The AI-generated information hints, particularly visualizations, aided users in transforming abstract ideas into data-backed concepts while reducing cognitive load. We further discuss how node-link diagrams can support both parallel exploration and iterative refinement in hypothesis formulation, potentially enhancing the breadth and depth of human-AI collaborative data analysis.


Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets

arXiv.org Machine Learning

Deep learning models, such as the Siamese Neural Networks (SNN), have shown great potential in capturing the intricate patterns in behavioral data. However, the impacts of dataset breadth (i.e., the number of subjects) and depth (e.g., the amount of training samples per subject) on the performance of these models is often informally assumed, and remains under-explored. To this end, we have conducted extensive experiments using the concepts of "feature space" and "density" to guide and gain deeper understanding on the impact of dataset breadth and depth on three publicly available keystroke datasets (Aalto, CMU and Clarkson II). Through varying the number of training subjects, number of samples per subject, amount of data in each sample, and number of triplets used in training, we found that when feasible, increasing dataset breadth enables the training of a well-trained model that effectively captures more inter-subject variability. In contrast, we find that the extent of depth's impact from a dataset depends on the nature of the dataset. Free-text datasets are influenced by all three depth-wise factors; inadequate samples per subject, sequence length, training triplets and gallery sample size, which may all lead to an under-trained model. Fixed-text datasets are less affected by these factors, and as such make it easier to create a well-trained model. These findings shed light on the importance of dataset breadth and depth in training deep learning models for behavioral biometrics and provide valuable insights for designing more effective authentication systems.


BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment

arXiv.org Artificial Intelligence

Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large language models (LLMs) in recent years. In this work, we first introduce the concepts of knowledge breadth and knowledge depth, which measure the comprehensiveness and depth of an LLM or knowledge source respectively. We reveal that the imbalance in the number of prompts and responses can lead to a potential disparity in breadth and depth learning within alignment tuning datasets by showing that even a simple uniform method for balancing the number of instructions and responses can lead to significant improvements. Building on this, we further propose Balanced Preference Optimization (BPO), designed to dynamically augment the knowledge depth of each sample. BPO is motivated by the observation that the usefulness of knowledge varies across samples, necessitating tailored learning of knowledge depth. To achieve this, we introduce gradient-based clustering, estimating the knowledge informativeness and usefulness of each augmented sample based on the model's optimization direction. Our experimental results across various benchmarks demonstrate that BPO outperforms other baseline methods in alignment tuning while maintaining training efficiency. Furthermore, we conduct a detailed analysis of each component of BPO, providing guidelines for future research in preference data optimization.


An Objective Laboratory Protocol for Evaluating Cognition of Non-Human Systems Against Human Cognition

arXiv.org Artificial Intelligence

It is virtually impossible to tease apart human capabilities from human cultural and other background knowledge, so this is necessary to provide an objective point of comparison against humans. Furthermore, a comprehensive understanding of human background knowledge, sufficient to not only recall but apply that knowledge, tests the cognitive capabilities essential to the human kind of understanding. I have recommended that human respondents be drawn from broad populations to ensure that this cultural knowledge is least-common-denominator rather than esoteric. The graders might be able to tell that they are scoring a non-human subject system. Difficulties with the Turing Test have demonstrated that this is probably not an issue. It is a relatively easy task to fool humans into thinking they are interacting with a human, even without human-level cognitive capabilities. Mimicking human interaction styles, though again not necessarily a goal of the subject system, should not be difficult for a system with cognition that is comparable to that of humans. Nevertheless, the reason the protocol attempts to disguise which respondents are human or non-human is not because this contributes to the evaluation, but merely to avoid implicit bias in scoring. All the test questions are raster images - does this mean the system has to do handwriting recognition?


Seattle Seahawks Select AWS as Its Cloud, Machine Learning, and Artificial Intelligence Provider

#artificialintelligence

In addition to moving the vast majority of its infrastructure to AWS, the National Football League (NFL) team will use the breadth and depth of AWS's services, including compute, storage, database, analytics, and ML to drive deep analysis of game footage to inform game strategy, improve operational efficiencies, and accelerate decision-making to advance team performance game-to-game. The Seahawks will combine the weekly NFL Next Gen Stats player tracking data, which tracks the position of the ball and every player 10 times per second, with its own player and club data to develop custom analytics and proprietary statistics. The Seattle Seahawks are relying on AWS's unmatched portfolio of services to discover actionable outcomes from its vast amount of player, team, and business data, enabling them to continue to compete at a championship caliber level. The Seahawks are building a data lake on Amazon Simple Storage Service (Amazon S3) that will combine team stats and NFL data, such as Next Gen Stats player tracking, player health and wellness data, and scouting information to provide deeper visibility into player capabilities, as well as give the coaching staff a single, real-time view of player and team performance. By applying AWS analytics services to the data, the Seahawks will be able to quickly uncover insights to better evaluate talent and develop game plans that take advantage of the team's strengths.


Seattle Seahawks Select AWS as Its Cloud, Machine Learning, and Artificial Intelligence Provider

#artificialintelligence

In addition to moving the vast majority of its infrastructure to AWS, the National Football League (NFL) team will use the breadth and depth of AWS's services, including compute, storage, database, analytics, and ML to drive deep analysis of game footage to inform game strategy, improve operational efficiencies, and accelerate decision-making to advance team performance game-to-game. The Seahawks will combine the weekly NFL Next Gen Stats player tracking data, which tracks the position of the ball and every player 10 times per second, with its own player and club data to develop custom analytics and proprietary statistics. The Seattle Seahawks are relying on AWS's unmatched portfolio of services to discover actionable outcomes from its vast amount of player, team, and business data, enabling them to continue to compete at a championship caliber level. The Seahawks are building a data lake on Amazon Simple Storage Service (Amazon S3) that will combine team stats and NFL data, such as Next Gen Stats player tracking, player health and wellness data, and scouting information to provide deeper visibility into player capabilities, as well as give the coaching staff a single, real-time view of player and team performance. By applying AWS analytics services to the data, the Seahawks will be able to quickly uncover insights to better evaluate talent and develop game plans that take advantage of the team's strengths.


The 2018 Locus Awards Present The Breadth And Depth Of Science Fiction And Fantasy

Forbes - Tech

'The Stone Sky' by N.K. Jemison won the 2018 Locus Award for the best fantasy novel. The 2018 Locus Awards were announced this past weekend. Unlike the Hugo and Nebula Awards, voting for the Locus Awards is open to everyone. Membership in an organization or even a subscription to Locus is not required. Subscribers do have an advantage, however, because their votes count double.


4 Approaches To Natural Language Processing & Understanding - TOPBOTS

@machinelearnbot

In 1971, Terry Winograd wrote the SHRDLU program while completing his PhD at MIT. SHRDLU features a world of toy blocks where the computer translates human commands into physical actions, such as "move the red pyramid next to the blue cube." To succeed in such tasks, the computer must build up semantic knowledge iteratively, a process Winograd discovered was brittle and limited. The rise of chatbots and voice activated technologies has renewed fervor in natural language processing (NLP) and natural language understanding (NLU) techniques that can produce satisfying human-computer dialogs. Unfortunately, academic breakthroughs have not yet translated to improved user experiences, with Gizmodo writer Darren Orf declaring Messenger chatbots "frustrating and useless" and Facebook admitting a 70% failure rate for their highly anticipated conversational assistant M. Nevertheless, researchers forge ahead with new plans of attack, occasionally revisiting the same tactics and principles Winograd tried in the 70s. OpenAI recently leveraged reinforcement learning to teach to agents to design their own language by "dropping them into a set of simple worlds, giving them the ability to communicate, and then giving them goals that can be best achieved by communicating with other agents."


The Top 18 Security Predictions for 2018

#artificialintelligence

Abraham Lincoln once said, "The best thing about the future is that it comes one day at a time." Winston Churchill once said, "If you're going through hell, keep going." And, "Never, never, never give up." As we look back at top cyber stories and security trends in 2017, these wise words from fearless leaders who have gone before us certainly apply to cybersecurity and the new 21st-century challenges confronting our world in 2018. Last year we started with, "You ain't seen nothing yet!"